Estimating user location in a system including smart audio devices

ABSTRACT

Methods and systems for performing at least one audio activity (e.g., conducting a phone call or playing music or other audio content) in an environment including by determining an estimated location of a user in the environment in response to sound uttered by the user (e.g., a voice command), and controlling the audio activity in response to determining the estimated user location. The environment may have zones which are indicated by a zone map and estimation of the user location may include estimating in which of the zones the user is located. The audio activity may be performed using microphones and loudspeakers which are implemented in or coupled to smart audio devices.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional Patent Application No. 62/880,118, filed Jul. 30, 2019, and entitled Coordination of Smart Audio Devices.

TECHNICAL FIELD

The invention pertains to systems and methods for coordinating (orchestrating) and implementing audio devices (e.g., smart audio devices), and to tracking user location in response to sound which is uttered by a user and detected by microphone(s) of a system including audio devices (e.g., smart audio devices).

BACKGROUND

At present, designers consider audio devices as a single point of interface for audio that may be a blend of entertainment, communications and information services. Using audio for notifications and voice control has the advantage of avoiding visual or physical intrusion. The expanding device landscape is fragmented with more systems competing for our one pair of ears. With wearable augmented audio starting to become available, things do not seem to be converging towards enabling the ideal pervasive audio personal assistant, and it has not been possible to use the multitude of devices around us for seamless capture, connectivity and communications.

It would be useful to develop methods and systems to bridge devices (e.g., smart audio devices), and better manage location, context, content, timing and user preference. Together, a set of standards, infrastructure and APIs could enable better access to a consolidated access to a user's environment (e.g., an audio space around a user). We contemplate methods and systems which manage basic audio input and output and allow connectivity of audio devices (e.g., smart audio devices) to implement particular activities (e.g., applications implemented by the systems or smart audio devices thereof).

BRIEF DESCRIPTION OF EMBODIMENTS

In a class of embodiments, a method and system in which a plurality of audio devices (e.g., smart audio devices) are coordinated, including by estimating (and typically also tracking) user location in response to sound which is uttered by a user and detected by microphone(s). Each microphone is included in a system which includes audio devices (e.g., smart audio devices), and typically, at least some microphones (of a plurality of microphones) are implemented in (or coupled to) smart audio devices of the system.

Some embodiments of the inventive method include performing (and some embodiments of the inventive system are configured to perform) at least one audio activity. Herein, an audio activity is an activity which includes detection of sound (using at least one microphone) and/or production of sound (by emitting the sound from at least one loudspeaker). Examples of audio activities include (but are not limited to) conducting a phone call (e.g., using at least one smart audio device), or playing music or other audio content (e.g., using at least one smart audio device) while detecting sound using at least one microphone (e.g., of at least one smart audio device). Some embodiments of the inventive method include controlling (and some embodiments of the inventive system are configured to control) at least one audio activity. Such control of an audio activity may occur together with (or contemporaneously with) performance and/or control of at least one video activity (e.g., displaying video), and each video activity may be controlled together with (or contemporaneously with) control of at least one audio activity.

In some embodiments, the method includes steps of:

performing at least one audio activity using a loudspeaker set of a system implemented in an environment, where the system includes at least two microphones and at least two loudspeakers, and the loudspeaker set includes at least one of the loudspeakers;

determining an estimated location of a user in the environment in response to sound uttered by the user (e.g., a voice command, or speech which is not a voice command), wherein the sound uttered by the user is detected by at least one of the microphones of the system; and

controlling the audio activity in response to determining the estimated location of the user, including by at least one of:

controlling at least one setting or state of the loudspeaker set; or

causing the audio activity to be performed using a modified loudspeaker set, where the modified loudspeaker set includes at least one loudspeaker of the system, but where the modified loudspeaker set is different than the loudspeaker set.

Typically, at least some of the microphones and at least some of the loudspeakers of the system are implemented in (or coupled to) smart audio devices.

In some embodiments, the method includes steps of: performing at least one audio activity using a transducer set of a system, where the transducer set includes at least one microphone and at least one loudspeaker, the system is implemented in an environment having zones, and the zones are indicated by a zone map; and determining an estimated location of a user in response to sound uttered by the user (e.g., a voice command, or speech which is not a voice command), including by detecting the sound uttered by the user using at least one microphone of the system, and estimating in which of the zones the user is located. Typically, the system includes microphones and loudspeakers, and at least some of the microphones and at least some of the loudspeakers are implemented in (or coupled to) smart audio devices. Typically also, the method includes a step of: controlling the audio activity in response to determining the estimated location of the user, including by at least one of: controlling at least one setting or state of the transducer set (e.g., of at least one microphone and/or at least one loudspeaker of the transducer set); or causing the audio activity to be performed using a modified transducer set, where the modified transducer set includes at least one microphone and at least one loudspeaker of the system, but where the modified transducer set is different than the transducer set.

In some embodiments of the inventive method, a step of controlling at least one audio activity is performed in response to both determining the estimated location of the user and at least one learned experience (e.g., learned preferences of a user). For example, such an audio activity may be controlled in response to data indicative of at least one learned experience, where the learned experience has been determined (e.g., by a learning module of an embodiment of the inventive system) from at least one previous activity (which occurred before the controlling step). For example, the learned experience may have been determined from previous user commands (e.g., voice commands) asserted under conditions which were the same or similar to those present during the current audio activity, and the controlling step may be performed in accordance with a probabilistic confidence based on data indicative of the learned experience.

In some embodiments, a system comprising coordinated multiple smart audio devices is configured to track the location of a user within a home or other environment (e.g., within zones of the environment), and determine a best set of loudspeaker(s) and microphone(s) of the system, given the current location of the user (e.g., the zone in which the user is currently located), for implementing the current audio activity (or activities) being or to be performed by the system. Tracking of the user location may be performed in response to sound uttered by the user (e.g., a voice command) and detected by at least one microphone (e.g., two or more microphones) of the system. Examples of the audio activities include (but are not limited to) implementing a phone call, watching a movie, listening to music, and listening to a podcast. The system may be configured to respond to a change in location of the user (e.g., movement of the user from one zone to another zone) including by determining a new (updated) best set of loudspeaker(s) and microphone(s) for the activity or activities.

Aspects of the invention include a system configured (e.g., programmed) to perform any embodiment of the inventive method or steps thereof, and a tangible, non-transitory, computer readable medium (for example, a disc or other tangible storage medium) which stores in a non-transitory manner code for performing (e.g., code executable to perform) any embodiment of the inventive method or steps thereof. For example, embodiments of the inventive system (or one or more elements thereof) can be or include a programmable general purpose processor, digital signal processor, or microprocessor (e.g., included in a smart phone or other smart audio device), programmed with software or firmware and/or otherwise configured to perform any of a variety of operations on data, including an embodiment of the inventive method or steps thereof. Such a general purpose processor may be or include a computer system including an input device, a memory, and a processing subsystem that is programmed (and/or otherwise configured) to perform an embodiment of the inventive method (or steps thereof) in response to data asserted thereto.

NOTATION AND NOMENCLATURE

Throughout this disclosure, including in the claims, “speaker” and “loudspeaker” are used synonymously to denote any sound-emitting transducer (or set of transducers) driven by a single speaker feed. A typical set of headphones includes two speakers. A speaker may be implemented to include multiple transducers (e.g., a woofer and a tweeter), all driven by a single, common speaker feed (the speaker feed may undergo different processing in different circuitry branches coupled to the different transducers).

Throughout this disclosure, including in the claims “wakeword” is used in a broad sense to denote any sound (e.g., a word uttered by a human, or some other sound), where a smart audio device is configured to awake in response to detection of (“hearing”) the sound (using at least one microphone included in or coupled to the smart audio device, or at least one other microphone. In this context, to “awake” denotes that the device enters a state in which it awaits (i.e., is listening for) a sound command.

Throughout this disclosure, including in the claims, the expression “wakeword detector” denotes a device configured (or software that includes instructions for configuring a device) to search continuously for alignment between real-time sound (e.g., speech) features and a trained model. Typically, a wakeword event is triggered whenever it is determined by a wakeword detector that the probability that a wakeword has been detected exceeds a predefined threshold. For example, the threshold may be a predetermined threshold which is tuned to give a good compromise between rates of false acceptance and false rejection. Following a wakeword event, a device might enter a state (which may be referred to as an “awakened” state or a state of “attentiveness”) in which it listens for a command and passes on a received command to a larger, more computationally-intensive recognizer.

Throughout this disclosure, including in the claims, the expression performing an operation “on” a signal or data (e.g., filtering, scaling, transforming, or applying gain to, the signal or data) is used in a broad sense to denote performing the operation directly on the signal or data, or on a processed version of the signal or data (e.g., on a version of the signal that has undergone preliminary filtering or pre-processing prior to performance of the operation thereon).

Throughout this disclosure including in the claims, the expression “system” is used in a broad sense to denote a device, system, or subsystem. For example, a subsystem that implements a decoder may be referred to as a decoder system, and a system including such a subsystem (e.g., a system that generates X output signals in response to multiple inputs, in which the subsystem generates M of the inputs and the other X−M inputs are received from an external source) may also be referred to as a decoder system.

Throughout this disclosure including in the claims, the term “processor” is used in a broad sense to denote a system or device programmable or otherwise configurable (e.g., with software or firmware) to perform operations on data (e.g., audio, or video or other image data). Examples of processors include a field-programmable gate array (or other configurable integrated circuit or chip set), a digital signal processor programmed and/or otherwise configured to perform pipelined processing on audio or other sound data, a programmable general purpose processor or computer, and a programmable microprocessor chip or chip set.

Throughout this disclosure including in the claims, the term “couples” or “coupled” is used to mean either a direct or indirect connection. Thus, if a first device couples to a second device, that connection may be through a direct connection, or through an indirect connection via other devices and connections.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1A is a diagram of a system which can be implemented in accordance with some embodiments of the invention.

FIG. 1B is a diagram of a system which can be implemented in accordance with some embodiments of the invention.

FIG. 2 is a block diagram of a system implemented in accordance with an embodiment of the invention.

FIG. 3 is a block diagram of an exemplary embodiment of module 201 of FIG. 2 .

FIG. 4 is a block diagram of another exemplary embodiment of module 201 of FIG. 2 .

FIG. 5 is a block diagram of a system implemented in accordance with another embodiment of the invention.

DETAILED DESCRIPTION OF EMBODIMENTS

Many embodiments of the present invention are technologically possible. It will be apparent to those of ordinary skill in the art from the present disclosure how to implement them. Some embodiments of the inventive system and method are described herein.

Examples of devices, which implement audio input, output and/or real time interaction, and are included in some embodiments of the inventive system, include (but are not limited to) wearables, home audio, mobile devices, automotive and mobile computing devices, and smart speakers. A smart speaker may include a network connected speaker and microphone for cloud based services. Other examples of devices, which are included in some embodiments of the inventive system, include (but are not limited to) speakers, microphones, and devices which include speaker(s) and/or microphone(s) (e.g., lights, clocks, personal assistant devices, and/or rubbish bins).

Herein, we use the expression “smart audio device” to denote a smart device which is either a single purpose audio device or a virtual assistant (e.g., a connected virtual assistant). A single purpose audio device is a device (e.g., a TV or a mobile phone) including or coupled to at least one microphone (and optionally also including or coupled to at least one speaker), and/or at least one speaker (and optionally also including or coupled to at least one microphone), and which is designed largely or primarily to achieve a single purpose. Although a TV typically can play (and is thought of as being capable of playing) audio from program material, in most instances a modern TV runs some operating system on which applications run locally, including the application of watching television. Similarly, the audio input and output in a mobile phone may do many things, but these are serviced by the applications running on the phone. In this sense, a single purpose audio device having speaker(s) and microphone(s) is often configured to run a local application and/or service to use the speaker(s) and microphone(s) directly. Some single purpose audio devices may be configured to group together to achieve playing of audio over a zone or user configured area.

A virtual assistant (e.g., a connected virtual assistant) is a device (e.g., a smart speaker or voice assistant integrated device) including or coupled to at least one microphone (and optionally also including or coupled to at least one speaker) and which may provide an ability to utilize multiple devices (distinct from the virtual assistant) for applications that are in a sense cloud enabled or otherwise not implemented in or on the virtual assistant itself. Virtual assistants may sometimes work together, e.g., in a very discrete and conditionally defined way. For example, two or more virtual assistants may work together in the sense that one of them, i.e., the one which is most confident that it has heard a wakeword, responds to the word. Connected devices may form a sort of constellation, which may be managed by one main application which may be (or implement) a virtual assistant.

Although the categories of single purpose audio device and virtual assistant are not strictly orthogonal, the speaker(s) and microphone(s) of an audio device (e.g., a smart audio device) may be assigned to functions that are either enabled by or attached to (or implemented by) a smart audio device. However, there is typically not a sense in which the audio device's speaker(s) and/or microphone(s), considered individually (distinct from the audio device), may be added to a collective.

In some embodiments, an orchestrated system is or includes multiple smart audio devices (and optionally also video devices). The system (and/or one or more devices thereof) is configured to implement (and perform) at least one application, including by tracking user location and selecting the best loudspeaker(s) and microphone(s) for the application. For example, the application may be or include conducting a telephone call, or listening to music or a podcast. In the case of a phone call, the application may involve selecting appropriate microphones and loudspeakers from (i.e., included in or coupled to) a known set of available audio devices based on the location of the user. In some embodiments, the user's location is determined from a voice command (or at least one other user utterance) and/or from an electronic locating beacon (e.g., using Bluetooth technology). In some embodiments, once the best microphone(s) and loudspeaker(s) are chosen, and the user then moves, a new set of best microphone(s) and loudspeaker(s) is determined for the new user location.

Each of FIGS. 1A and 1B is a diagram of a system which can be implemented in accordance with some embodiments of the invention. FIG. 1B differs from FIG. 1A in that the position 101 of the user in FIG. 1A is different than the position 113 of the user in FIG. 1B.

In FIGS. 1A and FIG. 1B, the labeled elements are:

107: Zone 1;

112: Zone 2;

101: user (talker) position, in zone 1;

102: Direct local voice (uttered by the user);

103: Plurality of loudspeakers in smart audio device (e.g., voice assistant device) located in zone 1;

104: Plurality of microphones in smart audio device (e.g., voice assistant device) located in zone 1;

105: Household appliance located in zone 1, e.g. a lamp;

106: Plurality of microphones in a household appliance located in zone 1;

113: user (talker) position, in zone 2

108: Plurality of loudspeakers in smart audio device (e.g., voice assistant device) located in zone 2;

109: Plurality of microphones in smart audio device (e.g., voice assistant device located in zone 2;

110: Household appliance (e.g., refrigerator) located in zone 2; and

111: Plurality of microphones in a household appliance located in zone 2.

FIG. 2 is a block diagram of a system implemented in an environment (e.g., a home) in accordance with an embodiment of the invention. The system implements a “follow me” mechanism to track user location. In FIG. 2 , the labeled elements are:

201: subsystem (sometimes referred to as a module or “follow me” module) configured to take inputs and make decisions (in response to the inputs) about best microphones and loudspeakers to use for a determined activity (e.g., indicated by input 206A);

201A: data indicative of a decision (determined in module 201) as to best loudspeaker(s) of the system to use for the determined activity, and/or the zone (i.e., one of the zones indicated by zone map 203) in which a user (e.g., a talker) is currently located;

201B: data indicative of a decision (determined in module 201) as to best microphone(s) of the system to use for the determined activity, and/or the zone (i.e., one of the zones indicated by zone map 203) in which the user is currently located;

202: User location subsystem (module), configured to determine location of the user (e.g., a talker, for example, the user of FIG. 1A or 1B), e.g., within a zone of the environment. In some embodiments, subsystem 202 is configured to estimate a user's zone (e.g., according to multiple acoustic features derived from at least some of the microphones 205). In some such embodiments, the goal is not to estimate the user's exact geometric location but to form a robust estimate of a discrete zone (e.g., in the presence of heavy noise and residual echo) in which the user is located;

202A: Information (data) indicative of current location of the user (talker), determined by module 202 and asserted to module 201;

203: Zone map subsystem, providing a zone map indicative of zones of the environment of the system (e.g., the zones of FIGS. 1A and 1B, if the system is in the environment of FIGS. 1A and 1B), and a list of all microphones and loudspeakers of the system grouped by their locations in the zones. In some implementations, subsystem 203 is or includes a memory which stores data indicative of the zone map;

203A: Information (data) about at least one zone (of the zone map) and the plurality of microphones and loudspeakers contained in each such zone (e.g., each of at least a subset of the zones) of the zone map, asserted (in some implementations of the system) to module 201 and/or to module 202;

204: Preprocessing subsystem, coupled and configured to perform preprocessing of the outputs of microphones 205. Subsystem 204 may implement one or more microphone preprocessing subsystems (e.g., an echo management subsystem, a wakeword detector, and/or a speech recognition subsystem, etc.);

204A: preprocessed microphone signal(s) generated by, and output from, subsystem 204;

205: Plurality of microphones (e.g., including microphones 104, 106, 109, and 111 of FIGS. 1A and 1B);

206: subsystem coupled and configured to implement at least one current audio activity (e.g., a plurality of current ongoing audio activities). Each such audio activity (sometimes referred to herein as an “activity” for convenience) includes detection of sound (using at least one microphone) and/or production of sound (by emitting the sound from at least one loudspeaker). Examples of such audio activities include, but are not limited to, music playback (e.g., including a step of providing audio for rendering by subsystem 207), a podcast (e.g., including a step of providing audio for rendering by subsystem 207), and/or a phone call (e.g., including providing teleconference audio for rendering by subsystem 207, and processing and/or transmitting each microphone signal provided to subsystem 204);

206A: Information (data) about the current ongoing activity or activities implemented by subsystem 206, which is generated by subsystem 206 and asserted from subsystem 206 to module 201;

207: Multichannel loudspeaker renderer subsystem, coupled and configured to render audio (e.g., by generating speaker feeds for driving speakers 208) generated or otherwise provided during performance of at least one current activity of the system. For example, subsystem 207 may be implemented to render audio for playback by a subset of speakers 208 (which may be implemented in or coupled to different smart audio devices) such that sound emitted by the relevant loudspeakers is perceivable (e.g., clearly, or in a best or desired manner) by the user, in the user's current location (e.g., zone), in accordance with data 201A;

208: Plurality of loudspeakers (e.g., including 103 and 108 of FIGS. 1A and 1B); and

401: voice command(s) from a user (e.g., a talker, for example, the user of FIG. 1A or 1B), which is or are output from subsystem 204 and provided to module 201 in typical implementations of the system.

Elements 201, 202, and 203 (or elements 202 and 203) may be referred to collectively as a user location and activity control subsystem of the FIG. 2 system.

Elements of the FIG. 2 system (and some other embodiments of the invention) may be implemented in or coupled to smart audio devices. For example, all or some of loudspeakers 208 and/or all or some of microphones 205 may be implemented in or coupled to one or more smart audio devices, or at least some of the microphones and loudspeakers may be implemented in a Bluetooth device connected to a Bluetooth transmitter/receiver (e.g., smartphone). Also for example, one or more other elements (e.g., all or some of elements 201, 202, 203, 204, and 206) of the FIG. 2 system (and/or all or some of elements 201, 202, 203, 204, 206, and 211 of the FIG. 5 system to be described below) may be implemented in or coupled to smart audio devices. In such example embodiments, the “follow me” module 201 operates (and other system elements operate) to coordinate (orchestrate) the smart audio devices, by tracking user location in response to sound (uttered by a user) and detected by at least one microphone of the system. For example, such coordination includes coordination of rendering of sound to be emitted by element(s) of the system and/or processing of the output(s) of microphone(s) of the system, and/or at least one activity implemented by the system (e.g., by element 206 of the system, for example by controlling activity manager 211 of FIG. 5 or another activity manager of the system).

Typically, subsystems 202 and 203 are tightly integrated. Subsystem 202 may receive outputs of all or some (e.g., two or more) of microphones 205 (e.g., implemented as asynchronous microphones). Subsystem 202 may implement a classifier, which in some examples is implemented in a smart audio device of the system. In other examples, the classifier may be implemented by another type of device (e.g., a smart device which is not configured to provide audio) of the system which is coupled and configured for communication with the microphones. For example, at least some of microphones 205 may be discrete microphones (e.g., in household appliances) which are not included in any smart audio device but which are configured for communication with a device which implements subsystem 202 as a classifier, and the classifier may be configured to estimate a user's zone according to multiple acoustic features derived from the output signals of each microphone. In some such embodiments, the goal is not to estimate the user's exact geometric location but to form a robust estimate of a discrete zone (e.g., in the presence of heavy noise and residual echo).

Herein, the expression “geometric location” (referred to in the previous and the following description) of an object, or a user, or a talker, in an environment, refers to a location based on a coordinate system (e.g., a coordinate system with reference to GPS coordinates), with reference to the system environment as a whole (e.g., according to a Cartesian or polar coordinate system having its origin somewhere within the environment) or with reference to a particular device (e.g., a smart audio device) within the environment (e.g., according to a Cartesian or polar coordinate system having the device as its origin). In some implementations, subsystem 202 is configured to determine an estimate of a user's location in the environment without reference to geometric locations of microphones 205.

“Follow me” module 201 is coupled and configured to operate in response to a number of inputs (one or more of 202A, 203A, 206A, and 401), and to produce one or both of outputs 201A and 201B. Examples of the inputs are next described in more detail.

Input 203A may be indicative of information regarding each zone of the zone map (sometimes referred to as acoustic zones), including but not limited to one or more of: a list of devices (e.g., smart devices, microphones, loudspeakers, etc.) of the system located within each zone, dimension(s) of each zone (e.g., in same coordinate system as geometric location units), geometric location of each zone (e.g., Kitchen, Living Room, Bedroom, etc.) with respect to the environment and/or with respect to other zones, geometric location of each device of the system (e.g., with respect to their respective zones and/or with respect to other ones of the devices), and/or name of each zone.

Input 202A may be or include real time information (data) regarding all or some of: the acoustic zone in which the user (talker) is located, the talker's geometric location within such zone, and for how long has the talker been in such zone. Input 202A may also include a degree of confidence by user location module 202 as to the accuracy or correctness of any of the information noted in the previous sentence, and/or a history of talker movement (e.g., within the past N hours, where the parameter N is configurable).

Input 401 may be a voice command, or two or more voice commands, uttered by the user (talker), each of which has been detected by preprocessing subsystem 204 (e.g., commands related or unrelated to the functionality of “follow me” module 201).

Output 201A of module 201 is an instruction to rendering subsystem (renderer) 207 to adapt processing according to the current (e.g., most recently determined) acoustic zone of the talker. Output 201B of module 201 is an instruction to preprocessing subsystem 204 to adapt processing according to the current (e.g., most recently determined) acoustic zone of the talker.

Output 201A may be indicative of the talker's geometric location with respect to the talker's current acoustic zone, as well as geometric location and distance of each of loudspeakers 208 with respect to the talker, e.g., to cause renderer 207 to perform rendering in the best way possible for the relevant activity being implemented by the system. The best way possible may depend on the activity and the zone, and optionally also on the talker's previously determined (e.g., recorded) preferences. For example, if the activity is a movie, and the talker is in the living room, output 201A may instruct renderer 207 to play back the audio of the movie using as many loudspeakers as possible for a cinema-like experience. If the activity is music, or a podcast, and the talker is in the kitchen, or in the bedroom, output 201A may instruct renderer 207 to render the music with only the closest loudspeakers, for a more intimate experience.

Output 201B may be indicative of a sorted list of some or all of microphones 205 for use by subsystem 204 (i.e., microphone(s) whose output(s) should not be ignored, and instead should be used (i.e., processed) by subsystem 204), and the geometric location of each such microphone with respect to the user (talker). In some embodiments, subsystem 204 may process outputs of some or all of microphones 205 in a manner determined by one or more of: distance of each microphone from the talker (as indicated by output 201B); wakeword score for each microphone (i.e., likelihood that the microphone heard a wakeword uttered by the user) if available; signal to noise ratio of each microphone (i.e., how much louder is speech uttered by the talker with respect to environmental noise and/or audio playback captured from the microphone); or a combination of two or more of the foregoing. The wakeword scores and signal to noise ratios may be calculated by preprocessing subsystem 204. In some applications, such as a phone call, subsystem 204 may only use the output of a best one of microphones 205 (as indicated by the list), or may implement beam forming with signals from a plurality of microphones from the list. To implement some applications, such as (for example) a distributed speech recognizer or a distributed wakeword detector, subsystem 204 may use outputs of a plurality of the microphones 205 (e.g., determined from a sorted list indicated by output 201B, where the sorting may be, for example, in order of proximity to the user).

In some exemplary applications, subsystem 204 (with modules 201 and 202) implements a microphone selection or adaptive beamforming scheme that attempts to pick up sound from the zone of the user more effectively (e.g., in order to better recognize a command that follows a wakeword), using (i.e., at least partially in response to) output 201B. In such scenarios, module 202 may use output 204A of subsystem 204 as feedback regarding the quality of user zone prediction to improve user zone determination in any of various was, including (but not limited to) the following:

penalizing predictions that result in misrecognition of a voice command following a wakeword. E.g., a user zone prediction that results in the user cutting short a voice assistant's response to a command (e.g., by uttering a counter-command, such as for example, “Amanda, stop!”) may be penalized;

penalizing predictions that result in low confidence that a speech recognizer (implemented by subsystem 204) has successfully recognized a command;

penalizing predictions that result in failure of a second-pass wakeword detector (implemented by subsystem 204) to retrospectively detect a wakeword with high confidence; and/or

reinforcing predictions that result in highly confident recognition of a wakeword and/or correct recognition of a user voice command.

FIG. 3 is a block diagram of elements of an exemplary embodiment of module 201 of FIG. 2 . In FIG. 3 , the labeled elements are:

elements of the FIG. 2 system (labeled identically in FIGS. 2 and 3 );

304: module which is coupled and configured to recognize at least one specific type of voice commands 401, and to assert an indication to module 303 (in response to recognizing that a voice command 401 is of a specific recognized type);

303: module which is coupled and configured to generate output signals 201A and 201B (or only one of signal 201A or signal 201B, in some implementations); and

401: voice command(s) from talker.

In the FIG. 3 embodiment, “follow me” module 201 is configured to operate as follows. In response to voice command 401 from the talker (e.g., “Amanda, move the call here” uttered while subsystem 206 is implementing a phone call), determining a changed set of loudspeakers (indicated by output 201A) and/or microphones (indicated by output 201B) for renderer 207 and/or subsystem 204 to use accordingly.

With module 201 implemented as in FIG. 3 , user location module 202 or subsystem 204 (both shown in FIG. 2 ) may be or include a simple command and control module that recognizes commands from the talker's direct local voice (i.e., microphone signal(s) 204A provided to module 202 from subsystem 204 is/are indicative of such local voice, or command 401 is provided to module 202 as well as to module 201). For example, preprocessing subsystem 204 of FIG. 2 may contain a simple command and control module coupled and configured to recognize voice commands (indicated by output(s) of one or more of microphones 205), and to provide output 401 (indicative of such commands) to module 202 as well as to module 201.

In an example of the FIG. 3 implementation of module 201, the module 201 is configured to respond to a voice command 401 from the talker (e.g. “move the call here”), including by:

knowing the talker's location (indicated by input 202A) as a result of zone mapping, to instruct renderer 207 in accordance with the current talker acoustic zone information (indicated by output 201A), so the renderer can change its rendering configuration to use the best loudspeaker(s) for the talker's current acoustic zone; and/or

knowing the talker's location (indicated by input 202A) as a result of zone mapping, to instruct preprocessing module 204 to use outputs of only the best microphone(s) in accordance with the current talker acoustic zone information (indicated by output 201B).

In an example of the FIG. 3 implementation of module 201, the module 201 is configured to operate as follows:

-   -   1. Wait for a voice command (401);     -   2. Upon receiving a voice command 401, determining (in module         304) whether the received command 401 is of a predetermined         specific type (e.g., is one of: “move [activity] here,” or         “follow me,” where “[activity]” here denotes any of the         activities currently being implemented by the system (e.g., by         subsystem 206);     -   3. If the voice command is not of the specific type, ignore the         voice command (so that output signal 201A and/or output signal         201B are generated by module 303 as if the voice command being         ignored was not received); and     -   4. If the voice command is of the specific type, generate (in         module 303) output signal 201A and/or output signal 201B to         instruct other elements of the system to change their processing         according to the current acoustic zone (as detected by user         location module 202 and indicated by input 202A.

FIG. 4 is a block diagram of another exemplary embodiment (labeled 300 in FIG. 4 ) of module 201 of FIG. 2 , and operation thereof. In FIG. 4 , the labeled elements are:

300: “Follow me” module;

elements of the FIG. 2 system (labeled identically in FIGS. 2 and 4 );

elements 303 and 304 of module 300 (labeled as are the corresponding elements of module 201 of FIG. 3 );

301: Database of data indicative of learned preferences from a talker's (e.g., the user's) past experiences. Database 301 may be implemented as a memory which stores the data in a non-transitory manner;

301A: Information (data) from database 301 regarding learned preferences from the talker's past experiences;

302: Learning module, coupled and configured to update database 301 in response to one or more of inputs 401 and/or 206A, and/or one or both of outputs 201A and 201B (generated by module 303);

302A: Updated information (data) about the talker's preferences (generated by module 302 and provided to database 301 for storage therein);

306: module coupled and configured to assess confidence in a determined talker location;

307: module coupled and configured to assess whether a determined talker location is a new location; and

308: module coupled and configured to request a user confirmation (e.g., confirmation of the user's location).

Follow me module 300 of FIG. 4 implements an extension to the exemplary embodiment of follow me module 201 of FIG. 3 , in that module 300 is configured to make automatic decisions about the best loudspeaker(s) and microphone(s) to use based on the talker's past experiences.

With module 201 of FIG. 2 implemented as module 300 of FIG. 4 , preprocessing subsystem 204 of FIG. 2 may include a simple command and control module coupled and configured to recognize voice commands (indicated by output(s) of one or more of microphones 205), and to provide output 401 (indicative of recognized commands) to both module 202 and module 300. More generally, user location module 202 or subsystem 204 (both shown in FIG. 2 ) may be or implement a command and control module which is configured to recognize commands from the talker's direct local voice (e.g., microphone signal(s) 204A provided to module 202 from subsystem 204 is/are indicative of such local voice, or a recognized voice command 401 is provided from subsystem 204 to module 202 as well as to module 300), and module 202 is configured to use the recognized commands to automatically detect the talker's location.

In the FIG. 4 embodiment, module 202 together with zone map 203 may implement an acoustic zone mapper (module 202 may be coupled and configured to operate with zone map 203, or may be integrated with zone map 203). In some implementations, the zone mapper may use the outputs of Bluetooth devices or other radio frequency beacons to determine the talker's location within the zones. In some implementations, the zone mapper may keep historical information in its own system, and generate output 202A (for provision to module 300 of FIG. 4 , or to another embodiment of module 201 of FIG. 2 ) to be indicative of a probabilistic confidence in the talker's location. The probability that the talker's location has been determined correctly may be used by module 306 (of module 300) to influence the acuity of the loudspeaker renderer (e.g., to cause output 201A to in turn cause renderer 207 to render the relevant audio in a more focused way if module 306 is sufficiently confident about the talker location, e.g., because module 300 has seen other instances, indicated by data 301A, of a talker speaking from that location). Conversely, if module 300 does not recognize that the talker has previously been located in a particular position, and module 306 has insufficient confidence in the talker location (e.g., confidence below a predetermined threshold), and module 306 may cause output 201A to be generated so as to cause renderer 207 to render the relevant audio to be perceived in a more general vicinity.

In the FIG. 4 implementation, a command 401 from the talker may cause module 300 to generate output 201A and/or output 201B to indicate a new set of current loudspeakers and/or microphones, and thus override the current loudspeakers and/or microphones in use, e.g., as in the exemplary embodiment of FIG. 3 . Depending on the talker's location within the acoustic zones (e.g., as indicated by input 202A), the confidence (as determined by module 306) that the talker is in fact within a determined zone, the activities currently ongoing (i.e., those being implemented by subsystem 206 of FIG. 2 , e.g., as indicated by input 206A), and past learned experiences (e.g., as indicated by data 301A), module 300 is configured makes an automatic decision to change the currently used loudspeakers and/or microphones for a determined ongoing activity. In some implementations, if the system is not sufficiently confident about such an automatic decision (e.g., if module 306 has confidence in a determined talker location that does not exceed a predetermined threshold), it may issue a request (e.g., module 306 may cause module 308 to cause output 201A to cause issuance of a request) for confirmation of the location from the talker. This request may be in the form of a voice prompt from the loudspeaker closest to the talker (for example the prompt “we've noticed you moved to the Kitchen, would you like to play music here?”).

Module 300 of FIG. 4 is configured to make automatic decisions regarding the configuration of renderer 207 and which microphone(s) the subsystem 204 should use based on the talker's movements within acoustic zones, and optionally on past experiences (indicated by data in database 301). To do so, module 300 may take into account input (e.g., command(s) 401) from the above-mentioned command and control module (implemented by preprocessing subsystem 204 or module 202) indicative of commands indicated by the talker's direct local voice, as well as information (e.g., input 202A generated by module 202) which indicates the talker's location.

After a decision is made by module 300 (i.e., to generate output 201A and/or output 201B to cause a change in a previously determined set of loudspeakers and/or microphones), learning module 302 may store data 302A into database 301, where data 302A may indicate whether the decision was satisfactory (e.g., the talker didn't manually override the decision) or unsatisfactory (e.g., the talker manually overrode the decision by issuing a voice command), in an effort to ensure a better automatically determined outcome in the future.

More generally, generation (e.g., updating) of output 201A and/or output 201B may be performed at the time of an ongoing audio activity in response to data (e.g., from database 301) indicative of learned experiences (e.g., learned preferences of a user) determined by learning module 302 (and/or another learning module of an embodiment of the inventive system) from at least one previous activity (which occurred before the generation of the outputs 201A and/or 201B, e.g., before the ongoing audio activity). For example, the learned experiences may be determined from previous user commands asserted under conditions which were the same or similar to those present during the current, ongoing audio activity, and output 201A and/or output 201B may be updated in accordance with a probabilistic confidence based on data (e.g., from database 301) indicative of such learned experiences (e.g., to influence the acuity of loudspeaker renderer 207 in the sense that the updated output 201A causes renderer 207 to render the relevant audio in a more focused way if module 300 is sufficiently confident about the user's preference based on the learned experiences).

Learning module 302 may implement a simple database of the most recent correct decision made in response to (and/or having) each set of the same inputs (provided to module 300) and/or features. Inputs to this database may be or include current system activity (e.g., indicated by input 206A), current talker acoustic zone (indicated by input 202A), previous talker acoustic zone (also indicated by input 202A), and an indication (e.g., indicated by a voice command 401) as to whether a previous decision in the same situation was correct. Alternatively, module 302 can implement a state map with probabilities that the talker wants to change the state of the system automatically, with each past decision, correct and incorrect, being added to such probability map. Alternatively, module 302 may be implemented as a neural network that learns based on all, or some of, the inputs of module 300, with its output being used to generate outputs 201A and 201B (e.g., to instruct renderer 207 and preprocessing module 204 whether a zone change is required or not).

An example flow of the processing performed by the FIG. 2 system (with module 201 implemented as module 300 of FIG. 4 ) is as follows:

-   -   1. A talker is in acoustic zone 1 (e.g., element 107 of FIG. 1A)         and starts a phone call with Anthony;     -   2. User location module 202 and follow me module 300 know the         talker is in zone 1, and module 300 generates outputs 201A and         201B to cause preprocessing module 204 to use a best microphone         (or microphones) for that zone, and causes renderer 207 to use a         best loudspeaker configuration for that zone;     -   3. The talker moves to acoustic zone 2 (e.g., element 112 of         FIG. 1B);     -   4. User location module 202 detects a change in the talker's         acoustic zone and asserts input 202A to module 300 so as to         indicate the change;     -   5. Module 300 remembers from past experiences (i.e., data in         database 301 indicates) that the talker asked to move a phone         call to the new acoustic zone upon talker movement in         circumstances like the current one. After a short amount of         time, confidence that the call should be moved goes above a set         threshold (as determined by module 306), and module 300         instructs preprocessing subsystem 204 to change the microphone         configuration to the new acoustic zone, and also instructs         renderer 207 to adjust its loudspeaker configuration to provide         the best experience for the new acoustic zone; and     -   6. The talker does not override the automatic decision by         uttering a voice command 401 (so that module 304 does not         indicate such an override to learning module 302 and to module         303), and learning module 302 causes data 302A to be stored in         database 301 to indicate that module 300 made the correct         decision in this case, reinforcing such decision for similar         future cases.

FIG. 5 is a block diagram of another exemplary embodiment of the inventive system. In FIG. 5 , the labeled elements are:

elements of the FIG. 2 system (labeled identically in FIGS. 2 and 5 );

211: Activity Manager, which is coupled to subsystem 206 and module 201, and has knowledge of the talker's activities in and beyond the environment (e.g., a home) in which the system is implemented;

212: a smartphone (of the system's user, who is sometimes referred to herein as a talker) coupled to activity manager 211, and a Bluetooth headset connected to the smartphone; and

206B: Information (data) about the current ongoing activity or activities implemented by subsystem 206 (and/or activities of the talker beyond the environment in which the system is implemented), which is generated by activity manager 211 and/or subsystem 206, and provided as an input to module 201.

In the FIG. 5 system, outputs 201A and 201B of “follow me” module 201 are instructions to activity manager 211, as well as to renderer 207 and preprocessing subsystem 204, which may cause each of them to adapt processing according to the talker's current acoustic zone (e.g., a new acoustic zone in which the talker is determined to be located).

In the FIG. 5 system, module 201 is configured to generate output 201A and/or output 201B in response to input 206B (and the other inputs provided to module 201). Output 201A of module 201 instructs renderer 207 (and/or activity manager 211) to adapt processing according to the current (e.g., newly determined) acoustic zone of the talker. Output 201B of module 201 instructs preprocessing subsystem 204 (and/or activity manager 211) to adapt processing according to the current (e.g., newly determined) acoustic zone of the talker.

An example flow of the processing implemented by the FIG. 5 system assumes that the system is implemented in a house, except that element 212 may operate either within or outside the house, and that module 201 is implemented as is module 300 of FIG. 4 . The example flow is as follows:

-   -   1. The talker is out of the house for a walk and receives a         phone call from Anthony on smartphone element 212;     -   2. The talker walks into the house, into acoustic zone 1 (e.g.,         element 107 of FIG. 1A), in the middle of the call and turns off         the Bluetooth headset of element 212;     -   3. User location module 202 and module 201 detect that talker         entered acoustic zone 1, and module 201 knows (from input 206B)         that the talker is in a phone call (being implemented by         subsystem 206) and that the Bluetooth headset of element 212 has         been turned off;     -   4. Module 201 remembers from past experiences that the talker         asked to move the call to the new acoustic zone in circumstances         similar to the current one. After a short amount of time,         confidence that the call should be moved rises above a         threshold, and module 201 instructs activity manager 211 (by         asserting appropriate output(s) 201A and/or 201B) that the call         should be moved from smartphone element 212 to the devices of         the FIG. 5 system which are implemented in the home; module 201         instructs preprocessing subsystem 204 (by asserting an         appropriate output 201B) to change the microphone configuration         to the new acoustic zone; and module 201 also instructs renderer         207 (by asserting an appropriate output 201A) to adjust its         loudspeaker configuration to provide the best experience for the         new acoustic zone; and     -   5. The talker does not override the automatic decision (made by         module 201) by uttering a voice command, and the learning module         (302) of module 201 stores data indicating that module 201 made         the correct decision in this case, for use in reinforcing such         decision for similar future cases.

Other embodiments of the inventive method are:

a method of controlling a system including a plurality of smart audio devices in an environment, wherein the system includes a set of one or more microphones (e.g., each of the microphones is included in or configured for communication with at least one of the smart audio devices in the environment) and a set of one or more loudspeakers, and wherein the environment includes a plurality of user zones, the method including steps of: determining an estimate of a user's location in the environment at least in part from output signals of the microphones, wherein the estimate is indicative of in which one of the user zones the user is located;

a method of managing an audio session across a plurality of smart audio devices, including a step of changing a set of currently used microphones and loudspeakers for an ongoing audio activity in response to a user's request or other sound uttered by the user; and

a method of managing an audio session across a plurality of smart audio devices, including a step of changing a set of currently used microphones and loudspeakers for an ongoing audio activity based on at least one previous experience (e.g., based on at least one learned preference from a user's past experiences).

Examples of embodiments of the invention include (but are not limited to) the following:

X1. A method, including steps of:

performing at least one audio activity using a loudspeaker set of a system implemented in an environment, where the system includes at least two microphones and at least two loudspeakers, and the loudspeaker set includes at least one of the loudspeakers;

determining an estimated location of a user in the environment in response to sound uttered by the user, wherein the sound uttered by the user is detected by at least one of the microphones of the system; and

controlling the audio activity in response to determining the estimated location of the user, including by at least one of:

controlling at least one setting or state of the loudspeaker set; or

causing the audio activity to be performed using a modified loudspeaker set, where the modified loudspeaker set includes at least one loudspeaker of the system, but where the modified loudspeaker set is different than the loudspeaker set.

X2. The method of X1, wherein the sound uttered by the user is a voice command.

X3. The method of X1 or X2, wherein the audio activity is conducting a phone call or playing audio content while detecting sound using at least one microphone of the system.

X4. The method of X1, X2, or X3, wherein at least some of the microphones and at least some of the loudspeakers of the system are implemented in or coupled to smart audio devices.

X5. The method of claim X1, X2, X3, or X4, wherein the step of controlling the audio activity is performed in response to determining the estimated location of the user and in response to at least one learned experience.

X6. The method of X5, wherein the system includes at least one learning module, and also including a step of:

before the controlling step, generating and storing data indicative of the learned experience using the at least one learning module.

X7. The method of claim X6, wherein the step of generating data indicative of the learned experience includes recognizing at least one voice command uttered by the user.

X8. A method, including steps of:

performing at least one audio activity using a transducer set of a system implemented in an environment, where the transducer set includes at least one microphone and at least one loudspeaker, the environment has zones, and the zones are indicated by a zone map; and

determining an estimated location of a user in response to sound uttered by the user, including by detecting the sound uttered by the user using at least one microphone of the system and estimating in which of the zones the user is located.

X9. The method of X8, wherein the sound uttered by the user is a voice command.

X10. The method of X8 or X9, wherein the audio activity is conducting a phone call or playing audio content while detecting sound using at least one microphone of the system.

X11. The method of X8, X9, or X10, wherein transducer set includes microphones and loudspeakers, and at least some of the microphones and at least some of the loudspeakers are implemented in or coupled to smart audio devices.

X12. The method of claim X8, X9, X10, or X11, also including:

controlling the audio activity in response to determining the estimated location of the user, including by at least one of:

controlling at least one setting or state of the transducer set; or

causing the audio activity to be performed using a modified transducer set, where the modified transducer set includes at least one microphone and at least one loudspeaker of the system, but where the modified transducer set is different than the transducer set.

X13. The method of X12, wherein the step of controlling the audio activity is performed in response to determining the estimated location of the user and in response to at least one learned experience.

X14. The method of X12 or X13, wherein the system includes at least one learning module, and also including a step of:

before the controlling step, generating and storing data indicative of the learned experience using the at least one learning module.

X15. The method of X14, wherein the step of generating data indicative of the learned experience includes recognizing at least one voice command uttered by the user.

X16. A computer readable medium which stores in a non-transitory manner code for performing the method of X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, or X15, or steps of any said method.

X17. A system for controlling at least one audio activity in an environment, wherein the audio activity uses at least two microphones and at least two loudspeakers in the environment, said system including:

a user location and activity control subsystem coupled and configured to determine an estimated location of a user in the environment in response to sound uttered by the user and detected by at least one of the microphones, and to control the audio activity in response to determining the estimated location of the user, wherein said control is or includes at least one of:

controlling at least one setting or state of a loudspeaker set, wherein the loudspeaker set includes at least one of the loudspeakers; or

causing the audio activity to be performed using a modified loudspeaker set, where the modified loudspeaker set includes at least one loudspeaker of the system, but where the modified loudspeaker set is different than the loudspeaker set.

X18. The system of X17, wherein the sound uttered by the user is a voice command.

X19. The system of X17 or X18, wherein the audio activity is or includes playing audio content while detecting sound using at least one of said at least two microphones, or conducting a phone call.

X20. The system of X17, X18, or X19, wherein at least some of the microphones and at least some of the loudspeakers are implemented in or coupled to smart audio devices.

X21. The system of X17, X18, X19, or X20, wherein the user location and activity control subsystem is configured to control the audio activity in response to determining the estimated location of the user and in response to at least one learned experience.

X22. The system of X21, wherein the system is configured to generate and store data indicative of the learned experience, including by recognizing at least one voice command uttered by the user.

X23. A system for determining user location during performance of at least one audio activity in an environment using a transducer set, wherein the environment has zones which are indicated by a zone map, the environment includes at least two microphones and at least two loudspeakers, and the transducer set includes at least one of the microphones and at least one of the loudspeakers, said system including:

a user location subsystem coupled and configured to determine an estimated location of a user in the environment in response to sound uttered by the user and detected using at least one of the microphones, including by estimating in which of the zones the user is located.

X24. The system of X23, wherein the sound uttered by the user is a voice command.

X25. The system of X23 or X24, wherein the audio activity is conducting a phone call or playing audio content while detecting sound using at least one of the microphones.

X26. The system of X23, X24, or X25, wherein at least some of the microphones and at least some of the loudspeakers are implemented in or coupled to smart audio devices.

X27. The system of claim X23, X24, X25, or X26, wherein the user location subsystem is a user location and activity control subsystem which is coupled and configured to control the audio activity in response to determining the estimated location of the user, wherein said control is or includes at least one of:

controlling at least one setting or state of the transducer set; or

causing the audio activity to be performed using a modified transducer set, where the modified transducer set includes at least one of the microphones and at least one of the loudspeakers, but where the modified transducer set is different than the transducer set.

X28. The system of X23, X24, X25, X26, or X27, wherein the user location and activity control subsystem is coupled and configured to control the audio activity in response to determining the estimated location of the user and in response to at least one learned experience.

X29. The system of X28, wherein the system is configured to generate and store data indicative of the learned experience, including by recognizing at least one voice command uttered by the user.

Aspects of the invention include a system or device configured (e.g., programmed) to perform any embodiment of the inventive method, and a tangible computer readable medium (e.g., a disc) which stores code for implementing any embodiment of the inventive method or steps thereof. For example, the inventive system can be or include a programmable general purpose processor, digital signal processor, or microprocessor, programmed with software or firmware and/or otherwise configured to perform any of a variety of operations on data, including an embodiment of the inventive method or steps thereof. Such a general purpose processor may be or include a computer system including an input device, a memory, and a processing subsystem that is programmed (and/or otherwise configured) to perform an embodiment of the inventive method (or steps thereof) in response to data asserted thereto.

Some embodiments of the inventive system are implemented as a configurable (e.g., programmable) digital signal processor (DSP) that is configured (e.g., programmed and otherwise configured) to perform required processing on audio signal(s), including performance of an embodiment of the inventive method. Alternatively, embodiments of the inventive system (or elements thereof) are implemented as a general purpose processor (e.g., a personal computer (PC) or other computer system or microprocessor, which may include an input device and a memory) which is programmed with software or firmware and/or otherwise configured to perform any of a variety of operations including an embodiment of the inventive method. Alternatively, elements of some embodiments of the inventive system are implemented as a general purpose processor or DSP configured (e.g., programmed) to perform an embodiment of the inventive method, and the system also includes other elements (e.g., one or more loudspeakers and/or one or more microphones). A general purpose processor configured to perform an embodiment of the inventive method would typically be coupled to an input device (e.g., a mouse and/or a keyboard), a memory, and a display device.

Another aspect of the invention is a tangible, non-transitory, computer readable medium (for example, a disc or other tangible storage medium) which stores in a non-transitory manner code for performing (e.g., code executable to perform) any embodiment of the inventive method or steps thereof. For example, elements 201, 202, and 203 (of the system of FIG. 2 or FIG. 5 ) may be implemented by a DSP (e.g., implemented in a smart audio device or other audio device) or general purpose processor, where the DSP or general purpose processor is programmed to perform an embodiment of the inventive method or steps thereof, and the general purpose processor or DSP (or another element of the system) may include a computer readable medium which stores in a non-transitory manner code for performing said embodiment of the inventive method or steps thereof.

While specific embodiments of the present invention and applications of the invention have been described herein, it will be apparent to those of ordinary skill in the art that many variations on the embodiments and applications described herein are possible without departing from the scope of the invention described and claimed herein. It should be understood that while certain forms of the invention have been shown and described, the invention is not to be limited to the specific embodiments described and shown or the specific methods described. 

What is claimed is:
 1. A method, including steps of: performing at least one audio activity using a loudspeaker set of a system implemented in an environment, wherein the system includes at least two microphones and at least two loudspeakers, the loudspeaker set includes at least one of the loudspeakers, the environment has zones, the zones are indicated by a zone map, and the system stores data indicative of the zone map, determining an estimated location of a user in the environment in response to sound uttered by the user, including by detecting the sound uttered by the user using at least one of the microphones of the system and estimating in which of the zones the user is located using the data indicative of the zone map; and controlling the audio activity in response to determining the estimated location of the user, including by at least one of: controlling at least one setting or state of the loudspeaker set; or causing the audio activity to be performed using a modified loudspeaker set, wherein the modified loudspeaker set includes at least one loudspeaker of the system, and wherein the modified loudspeaker set is different than the loudspeaker set, wherein the data indicative of the zone map includes a list of the at least two microphones and the at least two loudspeakers of the system grouped by location in the zones.
 2. The method of claim 1, wherein the sound uttered by the user is a voice command.
 3. The method of claim 1, wherein the audio activity is conducting a phone call or playing audio content while detecting sound using at least one microphone of the system.
 4. The method of claim 1, wherein at least some of the microphones and at least some of the loudspeakers of the system are implemented in or coupled to smart audio devices.
 5. The method of claim 1, wherein the step of controlling the audio activity is performed in response to determining the estimated location of the user and in response to at least one learned experience.
 6. The method of claim 5, wherein the system includes at least one learning module, and also including a step of: before the controlling step, generating and storing data indicative of the learned experience using the at least one learning module.
 7. The method of claim 6, wherein the step of generating data indicative of the learned experience includes recognizing at least one voice command uttered by the user.
 8. A computer readable medium which stores in a non-transitory manner code for performing the method of claim 1 or steps of said method.
 9. A system for controlling at least one audio activity in an environment, wherein the audio activity uses at least two microphones and at least two loudspeakers in the environment, the environment has zones, the zones are indicated by a zone map, and the system stores data indicative of the zone map, said system including: a user location and activity control subsystem coupled and configured to determine an estimated location of a user in the environment in response to sound uttered by the user, including by detecting the sound uttered by the user using at least one of the microphones of the system and estimating in which of the zones the user is located using the data indicative of the zone map, and to control the audio activity in response to determining the estimated location of the user, wherein said control is or includes at least one of: controlling at least one setting or state of a loudspeaker set, wherein the loudspeaker set includes at least one of the loudspeakers; or causing the audio activity to be performed using a modified loudspeaker set, wherein the modified loudspeaker set includes at least one loudspeaker of the system, and wherein the modified loudspeaker set is different than the loudspeaker set, wherein the data indicative of the zone map includes a list of the at least two microphones and the at least two loudspeakers of the system grouped by location in the zones.
 10. The system of claim 9, wherein the sound uttered by the user is a voice command.
 11. The system of claim 9, wherein the audio activity is or includes playing audio content while detecting sound using at least one of said at least two microphones, or conducting a phone call.
 12. The system of claim 9, wherein at least some of the microphones and at least some of the loudspeakers are implemented in or coupled to smart audio devices.
 13. The system of claim 9, wherein the user location and activity control subsystem is configured to control the audio activity in response to determining the estimated location of the user and in response to at least one learned experience.
 14. The system of claim 13, wherein the system is configured to generate and store data indicative of the learned experience, including by recognizing at least one voice command uttered by the user.
 15. The method of claim 1, wherein the data indicative of the zone map includes a list of devices of the system located within each of the zones.
 16. The method of claim 1, wherein the data indicative of the zone map includes a list of devices of the system located within each of the zones and a geometric location of each device of the system.
 17. The method of claim 1, wherein the data indicative of the zone map includes a list of devices of the system located within each of the zones, a geometric location of each device of the system, and a dimension of each of the zones.
 18. The method of claim 1, wherein the data indicative of the zone map includes a list of devices of the system located within each of the zones, a geometric location of each device of the system, and a geometric location of each of the zones with respect to each other of the zones.
 19. The method of claim 1, wherein determining the estimated location of the user uses data regarding how long the user has been in a given zone.
 20. The method of claim 1, wherein determining the estimated location of the user uses data regarding a history of user movement.
 21. The method of claim 1, further comprising: detecting a wakeword in the sound uttered by the user; and entering, by a device of the system, an awakened state in response to detecting the wakeword.
 22. The method of claim 21, wherein detecting the wakeword includes: detecting the wakeword by a plurality of microphones of the system; and calculating a wakeword score for each of the plurality of microphones in response to detecting the wakeword.
 23. The method of claim 6, wherein the learning module implements a state map.
 24. The method of claim 6, wherein the learning module implements a neural network.
 25. The system of claim 9, wherein the data indicative of the zone map includes a list of devices of the system located within each of the zones.
 26. The system of claim 9, wherein the data indicative of the zone map includes a list of devices of the system located within each of the zones and a geometric location of each device of the system.
 27. The system of claim 9, wherein the data indicative of the zone map includes a list of devices of the system located within each of the zones, a geometric location of each device of the system, and a dimension of each of the zones.
 28. The system of claim 9, wherein the data indicative of the zone map includes a list of devices of the system located within each of the zones, a geometric location of each device of the system, and a geometric location of each of the zones with respect to each other of the zones.
 29. The system of claim 9, wherein said control further includes: detecting a wakeword in the sound uttered by the user; and entering, by a device of the system, an awakened state in response to detecting the wakeword.
 30. The system of claim 29, wherein detecting the wakeword includes: detecting the wakeword by a plurality of microphones of the system; and calculating a wakeword score for each of the plurality of microphones in response to detecting the wakeword. 